L18: Robust panel regression

Lecture overview

Remember from last lecture that we are using the following empirical question to showcase the statistical tools needed to run robust regression analysis:

Which of the following firm characteristics (if any) have statistically significant predictive power over firms’ profitability: the firm’s cash holdings, its book leverage or its capital investments?

In the previous lecture, we collected the data we needed for this analysis, produced some summary statistics and ran a basic linear regression where firm future profitability is the dependent variable, and firm cash holdings, book leverage, and investment are the explanatory variables. In this lecture, we continue this analysis by tackling two very common issues with linear regression analysis:

The potential presence of “fixed-effects” in the data
The issue of correlated error terms in the regression

The statsmodels package we used for the introductory regression materials does not implement some of the tools we will discuss in this lecture. So in these lecture notes, we will be using the linearmodels package, which can be installed by typing:

pip install linearmodels

in a Terminal or Anaconda Prompt. Once you install the package, import the PanelOLS subpackage as below:

Preliminaries

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from linearmodels import PanelOLS

# Load data from last time
raw = pd.read_pickle('../data/comp_clean.zip')
raw.dtypes

permno          float64
datadate         object
ib              float64
at              float64
che             float64
dltt            float64
ppent           float64
sich            float64
year              int64
roa             float64
future_roa      float64
cash            float64
leverage        float64
investment      float64
w_future_roa    float64
w_cash          float64
w_leverage      float64
w_investment    float64
const             int64
dtype: object

# Make lists of variable names for convenience
yvar = 'w_future_roa'
xvars = ['const','w_cash', 'w_leverage','w_investment'] 
main_vars = [yvar] + xvars

# Keep only the data we need and set the index
comp = raw[['permno','year','sich'] + main_vars].copy()
comp['const'] = 1
comp = comp.set_index(['permno','year'])
comp

		sich	w_future_roa	const	w_cash	w_leverage	w_investment
permno	year
10000.0	1986	NaN	NaN	1	0.164539	0.027423	NaN
10001.0	1986	NaN	0.026506	1	0.060938	0.240647	NaN
	1987	4924.0	0.046187	1	0.061932	0.233625	-0.000170
	1988	4924.0	0.065069	1	0.063400	0.217725	-0.019599
	1989	4924.0	0.059901	1	0.063399	0.396984	0.332669
...	...	...	...	...	...	...	...
93436.0	2016	3711.0	-0.068448	1	0.154374	0.267113	0.361891
	2017	3711.0	-0.032821	1	0.122952	0.331046	0.190355
	2018	3711.0	-0.025125	1	0.130404	0.317894	-0.026913
	2019	3711.0	0.013826	1	0.189863	0.368038	0.014800
	2020	3711.0	NaN	1	0.376275	0.208790	0.060904

237017 rows × 6 columns

Endogeneity

We say that your regression may suffer from an endogeneity problem (or an endogeneity bias) if you suspect that the mean independence assumption (see assumption A2 in the regression intro lecture) is not satisfied, i.e. if you think that:

\[E[\epsilon_t | X] \neq0\]

There are many reasons why this issue might arise (look up “omitted variable bias”, “reverse causality bias”, and “measurement error bias” if you are interested in a deeper analysis). We will not go into each of these possible sources of endogeneity. Here, we only describe the two common ways to address endogeneity issues, and we implement only the latter.

Instrumental Variables (IV) estimation
- The main idea behind this approach is to find, for every endogenous variable X, another variable Z (called an “instrument”) which is correlated with X (aka the “relevance” condition), but does not affect the dependent variable in any way other than through its relation with X (aka the “validity” condition). The Z instrument is then used to extract the exogenous variation in X, which in turn is used in our main regression instead of X.
- This is a very general approach (it can be used regardless of what is causing the endogeneity issue) but it’s a bit too advanced to cover in this course. I will simply mention that the “linearmodels” package we use in this lecture can also run IV estimation using the “IV2SLS” subpackage and I’ll leave this for you to study at your own pace.
Fixed effects estimation
- This approach deals with the situation in which the endogeneity problem is caused by some unobservable, omitted variable, that is constant either in the cross section or over time
  - Example 1: firm fixed effects
    - It may be possible that the firm’s ROA is also determined by management quality (which we can not measure easily). If high-quality managers, say, also like to hold a lot of cash, then the cash holdings variable in endogenous (in the equation above, cash holdings is part of X and management quality would be part of \(\epsilon\) since it affects ROA but is not part of our explanatory variables X). However, if management quality is relatively constant over time, we can control for its effects on ROA by demeaning the data at the firm level. This is what a firm fixed effects estimator does.
  - Example 2: time fixed effects
    - It may be the case that, in any given year, some macroeconomic shock affects both ROAs and, cash holdings for all firms (e.g. a recession will decrease ROAs and increase cash holdings almost across the board). If this is the case, then cash holdings is again endogenous. But if the macroeconomic shock is the (approximately) the same for all firms, then we can control for its effects (and fix our endogeneity problem) by demeaning the data at the year level. This is what a time fixed effect estimator does.
- Below, we show how control for both firm and year fixed effects in our example application

We will estimate fixed-effects regressions using the PanelOLS function that we imported above:

Abbreviated syntax:

PanelOLS(dependent, exog, ,entity_effects=False, time_effects=False, other_effects=None)

The first two arguments is where you tell the function what to use for the dependent variable and independent variables respectively. For firm fixed effects, you set entity_effects = True, for time fixed-effects, you set time_effects = True and for fixed effects at any other level (e.g. industry), you have to specify the name of the variable that determines which observation is in what group (e.g. an industry identifier for industry fixed-effects). For entity_effects and time_effects, PanelOLS assumes that the first dimension of the index contains the firm identifier, and the second dimension contains the time identifier (which is why we used set_index(['permno','year']) above).

# Run basic regression again, for comparison (ignore warning about missing values if you get one)
results = PanelOLS(dependent = comp[yvar], 
                          exog = comp[xvars], 
                         ).fit();
print(results.summary)

C:\Users\ionmi\anaconda3\lib\site-packages\linearmodels\panel\data.py:98: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  if is_categorical(s):
C:\Users\ionmi\anaconda3\lib\site-packages\linearmodels\utility.py:549: MissingValueWarning: 
Inputs contain missing values. Dropping rows with missing observations.
  warnings.warn(missing_value_warning_msg, MissingValueWarning)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0955
Estimator:                   PanelOLS   R-squared (Between):              0.1031
No. Observations:              185315   R-squared (Within):              -0.0642
Date:                Fri, Feb 25 2022   R-squared (Overall):              0.0955
Time:                        14:36:53   Log-likelihood                    5498.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      6524.2
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,185311)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             6524.2
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,185311)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const            0.0314     0.0010     32.312     0.0000      0.0295      0.0334
w_cash          -0.3712     0.0028    -134.43     0.0000     -0.3766     -0.3658
w_leverage      -0.0727     0.0030    -23.971     0.0000     -0.0787     -0.0668
w_investment     0.1634     0.0064     25.736     0.0000      0.1510      0.1759
================================================================================

You can check that the above results are the identical to the ones we obtained in the last lecture, using statsmodels.api.OLS. The PanelOLS function also tells us that we have 19,139 different entities (firms) in our sample, and 41 different time periods (years).

Firm fixed effects

results_firmfe = PanelOLS(dependent = comp[yvar], 
                          exog = comp[xvars], 
                          entity_effects = True
                         ).fit();
print(results_firmfe.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0035
Estimator:                   PanelOLS   R-squared (Between):             -0.0776
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0129
Time:                        14:36:53   Log-likelihood                 8.094e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      196.81
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166173)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             196.81
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166173)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0357     0.0010    -34.112     0.0000     -0.0377     -0.0336
w_cash           0.0208     0.0038     5.4863     0.0000      0.0134      0.0283
w_leverage      -0.0557     0.0036    -15.452     0.0000     -0.0628     -0.0487
w_investment     0.0862     0.0050     17.261     0.0000      0.0764      0.0960
================================================================================

F-test for Poolability: 10.917
P-value: 0.0000
Distribution: F(19138,166173)

Included effects: Entity

The P-value under F-test for Poolability is very low, which tells us that the firm fixed effects are jointly statistically significant in our regression (i.e. we should keep them in our regression).

Note how the coefficients have changed now that we have included firm fixed effects in our regression. In particular, note that the coefficient on w_cash has changed sign.

Time fixed effects

results_timefe = PanelOLS(dependent = comp[yvar], 
                          exog = comp[xvars], 
                          time_effects = True
                         ).fit();
print(results_timefe.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0944
Estimator:                   PanelOLS   R-squared (Between):              0.1033
No. Observations:              185315   R-squared (Within):              -0.0645
Date:                Fri, Feb 25 2022   R-squared (Overall):              0.0955
Time:                        14:36:54   Log-likelihood                    6386.9
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      6436.1
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,185271)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             6436.1
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,185271)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const            0.0308     0.0010     31.563     0.0000      0.0289      0.0327
w_cash          -0.3710     0.0028    -133.37     0.0000     -0.3765     -0.3656
w_leverage      -0.0694     0.0030    -22.855     0.0000     -0.0754     -0.0634
w_investment     0.1677     0.0064     26.382     0.0000      0.1553      0.1802
================================================================================

F-test for Poolability: 44.613
P-value: 0.0000
Distribution: F(40,185271)

Included effects: Time

Once again, the P-value for the F-test for Poolability is very small, which means we should also keep the time fixed effects in our regression. Combined with the previous result, this means we should be including both firm and time fixed effects, which is what we do below.

Note also how the coefficient on w_cash has changed sign again.

Both time and year fixed effects:

results_bothfe = PanelOLS(dependent = comp[yvar], 
                          exog = comp[xvars], 
                          entity_effects = True, time_effects = True,
                         ).fit();
print(results_bothfe.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:36:55   Log-likelihood                 8.184e+04
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             160.89
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0010    -34.326     0.0000     -0.0379     -0.0338
w_cash           0.0162     0.0038     4.2602     0.0000      0.0087      0.0236
w_leverage      -0.0499     0.0036    -13.775     0.0000     -0.0570     -0.0428
w_investment     0.0813     0.0050     16.210     0.0000      0.0715      0.0912
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

In this final specification, it seems like cash holdings are positively associated with future profitability.

Sector fixed effects

comp['sic2d'] = comp['sich'].astype('string').str[0:2]
comp['sic2d'].value_counts()

73    18965
28    16678
36    13361
60    11046
38    10881
      ...  
76       86
81       34
86       11
90       11
89        8
Name: sic2d, Length: 69, dtype: Int64

Note that ‘sic2d’ contains missing values, which will give us an error if we try to use them as fixed-effects. So we get rid of all missing values in our regression data, and store this in a new dataframe first:

df = comp[main_vars + ['sic2d']].dropna()

Now we can run our industry fixed-effects regression:

results_indfe = PanelOLS(dependent = df[yvar], 
                          exog = df[xvars], 
                          other_effects = df['sic2d']
                         ).fit();
print(results_indfe.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0571
Estimator:                   PanelOLS   R-squared (Between):              0.0906
No. Observations:              153676   R-squared (Within):              -0.0440
Date:                Fri, Feb 25 2022   R-squared (Overall):              0.1012
Time:                        14:36:56   Log-likelihood                    1745.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      3102.9
Entities:                       23812   P-value                           0.0000
Avg Obs:                       6.4537   Distribution:                F(3,153604)
Min Obs:                       0.0000                                           
Max Obs:                       67.000   F-statistic (robust):             3102.9
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,153604)
Avg Obs:                       3748.2                                           
Min Obs:                       0.0000                                           
Max Obs:                       5811.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const            0.0158     0.0012     13.445     0.0000      0.0135      0.0181
w_cash          -0.3090     0.0034    -91.633     0.0000     -0.3156     -0.3024
w_leverage      -0.0573     0.0036    -15.832     0.0000     -0.0644     -0.0502
w_investment     0.1727     0.0074     23.452     0.0000      0.1583      0.1871
================================================================================

F-test for Poolability: 85.603
P-value: 0.0000
Distribution: F(68,153604)

Included effects: Other Effect (sic2d)
Model includes 5 other effects
Other Effect Observations per group (sic2d):
Avg Obs: 2227.2, Min Obs: 4.0000, Max Obs: 1.44e+04, Groups: 69

Heteroskedasticity and correlated errors

We say that our regression may have a heteroskedasticity problem if we believe not all the residual terms in the regression (\(\epsilon\)’s) have the same variance, i.e. assumption A3 (see regression intro lecture) is not satisfied. As long as these error terms are not correlated with each other, then we can fix the heteroskedasticity problem by calculating “White” standard errors (as in the section below).

However, if the believe the residual terms may be correlated with each other (which again violates assumption A3), to address this issue, we have to be more explicit about the dimension in which these correlations occur. The most common are:

Residuals correlated within firm (i.e. the residuals of a single firm are correlated over time)
- We address this issue by specifying that our standard errors are “clustered” at the firm level
Residuals correlated within time (i.e. the residuals of all firms are correlated within a year)
- We address this issue by specifying that our standard errors are clustered at the year level (or month level for monthly frequency data, etc.)

To address the problems we highlighted above, we specify how we want our standard errors to be calculated by providing different parameters to the .fit() function:

Abbreviated syntax:

PanelOLS.fit(cov_type='unadjusted', debiased=True, auto_df=True, count_effects=True, **cov_config)

In articular, we will use cov_type = 'robust' if we are just worried about heteroskedasticity and we want “White” standard errors. We will use cov_type = 'clustered' if are worried about correlated residuals. In this case, we need to specify cluster_entity = True if we think residuals are autocorrelated within firm, and/or cluster_time = True if we thing residuals might be correlated within each time period.

We will use the model with firm- and time- fixed effects for the rest of this lecture. So we will just specify the regression model below, and use this model over and over again, each time, specifying a different way to fix standard errors with the .fit() function. Note than none of these “fixed” will change the regression coefficients themselves. Only their statistical significance.

model = PanelOLS(comp[yvar], comp[xvars], entity_effects = True, time_effects = True)
model

PanelOLS 
Num exog: 4, Constant: True
Entity Effects: True, Time Effects: True, Num Other Effects: 0
id: 0x1f22fc75df0

White standard errors

results_white = model.fit(cov_type = 'robust');
print(results_white.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:36:58   Log-likelihood                 8.184e+04
Cov. Estimator:                Robust                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             73.502
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0016    -22.864     0.0000     -0.0389     -0.0328
w_cash           0.0162     0.0062     2.5922     0.0095      0.0039      0.0284
w_leverage      -0.0499     0.0051    -9.7033     0.0000     -0.0599     -0.0398
w_investment     0.0813     0.0074     11.032     0.0000      0.0669      0.0958
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

Clustering standard errors at the firm level

results_firm_cluster = model.fit(cov_type = 'clustered', cluster_entity = True);
print(results_firm_cluster.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:37:00   Log-likelihood                 8.184e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             47.063
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0021    -17.288     0.0000     -0.0399     -0.0318
w_cash           0.0162     0.0089     1.8152     0.0695     -0.0013      0.0336
w_leverage      -0.0499     0.0070    -7.1564     0.0000     -0.0635     -0.0362
w_investment     0.0813     0.0083     9.8217     0.0000      0.0651      0.0976
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

Note how the cash holding variable (which, last lecture, we thought has the highest predictive power over future profitability), is no longer statistically significant at the 95% confidence level.

Clustering standard errors at the year level

results_time_cluster = model.fit(cov_type = 'clustered', cluster_time = True);
print(results_time_cluster.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:37:01   Log-likelihood                 8.184e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             33.385
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0023    -15.705     0.0000     -0.0403     -0.0314
w_cash           0.0162     0.0093     1.7413     0.0816     -0.0020      0.0343
w_leverage      -0.0499     0.0070    -7.1542     0.0000     -0.0635     -0.0362
w_investment     0.0813     0.0112     7.2616     0.0000      0.0594      0.1033
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

Cluster at both the firm and year level

results_both_cluster = model.fit(cov_type = 'clustered', cluster_entity = True, cluster_time = True);
print(results_both_cluster.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:37:03   Log-likelihood                 8.184e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             25.269
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0027    -13.498     0.0000     -0.0410     -0.0306
w_cash           0.0162     0.0112     1.4367     0.1508     -0.0059      0.0382
w_leverage      -0.0499     0.0084    -5.9289     0.0000     -0.0663     -0.0334
w_investment     0.0813     0.0118     6.8815     0.0000      0.0582      0.1045
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

This is probably the specification that I would choose going forward since it accounts for both firm and time fixed effects and then adjusts for any remaining correlation in residuals along both of those dimensions.

Clustering standard errors at the sector level

We can cluster standard errors along other dimensions of correlation as well. Below we cluster at the industry level.

results_ind_cluster = model.fit(cov_type = 'clustered', clusters = comp['sic2d']);
print(results_ind_cluster.summary)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:           w_future_roa   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0734
No. Observations:              185315   R-squared (Within):               0.0035
Date:                Fri, Feb 25 2022   R-squared (Overall):             -0.0097
Time:                        14:37:06   Log-likelihood                 8.184e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      160.89
Entities:                       19139   P-value                           0.0000
Avg Obs:                       9.6826   Distribution:                F(3,166133)
Min Obs:                       1.0000                                           
Max Obs:                       80.000   F-statistic (robust):             41.229
                                        P-value                           0.0000
Time periods:                      41   Distribution:                F(3,166133)
Avg Obs:                       4519.9                                           
Min Obs:                       4.0000                                           
Max Obs:                       6276.0                                           
                                                                                
                              Parameter Estimates                               
================================================================================
              Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
--------------------------------------------------------------------------------
const           -0.0358     0.0065    -5.4738     0.0000     -0.0487     -0.0230
w_cash           0.0162     0.0277     0.5834     0.5596     -0.0381      0.0704
w_leverage      -0.0499     0.0114    -4.3621     0.0000     -0.0723     -0.0275
w_investment     0.0813     0.0089     9.1502     0.0000      0.0639      0.0987
================================================================================

F-test for Poolability: 11.083
P-value: 0.0000
Distribution: F(19178,166133)

Included effects: Entity, Time

Now the cash holdings variable is not even significant at the 45% confidence level. The different steps we went through in this lecture to make sure our results are trustworthy, show that the results of our analysis can change quite drastically once we take those steps: using a simple regression specification in the last lecture, it seemed like the cash holdings variable was the strongest predictor of future profitability (with a negative coefficient). Now, we see that in reality, the cash holdings variable is the only one that is NOT statistically significant in our regression.